Investigation of Wine Quality by T.Y. Chen
Some notes:
I used different color (instead of a gradient of color) for different level of “quality” as I found it clearer for me to recognize the relationship~ (gradient of color for me is a bit difficult to read) (and using different kind of line instead of gradient of color is of same reason)
Thanks for the suggestion. However I found it easier to comprehend if I discuss all the graphs together at the end of each kind of graph, as some of the vars have quite similar properties.
What I’m gonna do is to first plot all possible plots in the dataset, then spot some specific plot/ vars to discuss further. I think this preserve the fact that we can also look at all the vars alltogeter, while still preserves some sort of reader-friendliness
Thank you!
Univariate Plots Section
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Univariate Analysis

graph summary
- Shape: approx. normal; skewed right; however don’t have much observation around 10
- Center: around 6-8
- Spread: pretty wide; approx 6-16
- Outliers: a few at 15-17

graph summary
- Shape: approx. uniform distributed; with decreasing number of observations on the right side
- Center: around 0.3-0.7
- Spread: 0.2-1.2
- Outliers: a few at 1.6

graph summary
- Shape: quite random; however the number start to decrease from 0.5 on
- Center: hard to say; maybe around 0.3
- Spread: 0-0.75
- Outliers: at 1

graph summary
- Shape: bell-shaped; right-skewed
- Center: approx.1.5
- Spread: 0-6
- Outliers: at 8,11,13-15,16

graph summary
- Shape: similar to last one; bell-shaped; right-skewed
- Center: approx. 0.1
- Spread: 0-0.2
- Outliers: at 0.2,0.3-0.4,0.6

graph summary
- Shape: as value increases, the number of observation decreases; right-skewed
- Center: hard to say; approx. 15
- Spread: 0-50
- Outliers: at 65-70

graph summary
- Shape: as value increases, the number of observation decreases; right-skewed
- Center: hard to say; approx. 50
- Spread: 0-150
- Outliers: at 300

graph summary
- Shape: normally distributed; skewed a bit left
- Center: 0.997
- Spread: 0.990-1.005
- Outliers: at 0.990

graph summary
- Shape: normally distributed; pretty symmetric
- Center: 3.3
- Spread: 2.8-3.7
- Outliers: at 4

graph summary
- Shape: approx. normally distributed; skewed right
- Center: 0.7
- Spread: 0.4-1.2
- Outliers: at 1.6,2

graph summary
- Shape: as value increases, the number of observation decreases; right-skewed
- Center: hard to say; 12
- Spread: 9-14
- Outliers: at 15
summary
- most of the vars distributed normally (bell-shaped)
- sulphates, total sulfur dioxide, free sulfur dioxide, chlorides, residual sugar are all skewed right
What is the structure of your dataset?
1599 obs. 12 vars: 11 independent, 1 dependent vars.
What is/are the main feature(s) of interest in your dataset?
“quality” is the main feature of interest. Other vars serves as classfier for this var.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
Four main categories:
- basic characterstics: pH, density, alcohol
- acidity: fixed.acidity, volatile.acidity, citric.acid
- (sulfur) dioxide: free.sulfur.dioxide, total.sulfur.dioxide
- other flavor: residual.sugar, chlorides
Did you create any new variables from existing variables in the dataset?
No.
Bivariate Plots Section
Bivariate Analysis
For the boxplot + jitter graph:
- quality “5” & “6”" has lots of outlier in most vars
- 5,6 (regarding acidity) is further verified
- citric.acid gets higher as quality gets better
- residual sugar basically remains quite the same across all quality, the 1 claim might be a effect of some outliers in higher quality wine. so does chlorides
- “free.sulfur.dioxide” has these “bell-shaped” distribution as stated in 3, while the “total.sulfur.dioxide” are actually more “bell-shaped” than stated
- 4 (density,pH) is further verfied
- sulphates actually gets higher as quality increases
- alcohol does go up as quality goes up, however low quality wine has relatively high alcohol
individual plots


graph summary
- relationship: sulphates gets higher as the quality increases
- distribution & outlier: most of the observation lies within the box range. but for quality 5,6, there’s a lot of outliers


graph summary
- relationship: total.sulfur.dioxide gets higher as the quality increases; but the best wine have low total.sulfur.dioxide
- distribution & outlier: for wine quality 5,6, the values are skewed left (smaller value); also, for quality 6, there’s lots of outlier


graph summary
- relationship: bell-shaped
- distribution & outlier: quite evenly; however, for quality 5,6 the spread is quite wide

graph summary
- relationship: bell-shaped


graph summary
- relationship: bell-shaped
- distribution & outlier: spread is narrow, which means the medium is a good representation of the whole dataset; however, quality 5,6 has quite a few outlier

graph summary
- relationship: inverted-bell-shaped
- distribution & outlier: spread is quite wide

graph summary
- relationship: citric.acid gets higher as the quality increases
- distribution & outlier: spread is quite wide; however for quality 5,6, there’s lots of observations are at the botton (0)
summary
- “fixed.acidity”: has weak linkage with quality, however bad wine do have lower fixed.acidity
- “volatile.acidity”: gets lower when quality increases (strong)
- “citric.acid”: gets higher as quality gets better
- “residual.sugar”: (relatively) unrelated
- “chlorides”: (relatively) unrelated
- “free.sulfur.dioxide”: bell-shaped
- “total.sulfur.dioxide”: bell-shaped
- “density”: (relatively) unrelated
- “pH”: (relatively) unrelated
- “sulphates”: gets higher as quality increases
- “alcohol”: alcohol does go up as quality goes up, however low quality wine has relatively high alcohol
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
(using some of the graphs in the next section)
- positive linear:
- “fixed.acidity”, “citric.acid” -> strong
- “fixed.acidity”, “density” -> strong
- “citric.acid”, “fixed.acidity”
- negative linear:
- “fixed.acidity”, “pH” -> strong
- “citric.acid”, “fixed.acidity”
- “citric.acid”,“pH”
- “density”,“alcohol”
- other
- “citric.acid”, “volatile.acidity”
- “free.sulfur.dioxide”,“total.sulfur.dioxide” -> strong
What was the strongest relationship you found?
with quality:
“volatile.acidity”
with other vars:
- “fixed.acidity”, “citric.acid”
- “fixed.acidity”, “density”
- “fixed.acidity”, “pH”
- “free.sulfur.dioxide”,“total.sulfur.dioxide”
Multivariate Plots Section
Multivariate Analysis
note: the numbers are the mutual info of the respective vars pairs

## [1] "the mutual value of"
## [1] "fixed.acidity"
## [1] "citric.acid"
## [1] 0.3263575

## [1] "the mutual value of"
## [1] "fixed.acidity"
## [1] "density"
## [1] 0.3268869

## [1] "the mutual value of"
## [1] "fixed.acidity"
## [1] "pH"
## [1] 0.3384619

## [1] "the mutual value of"
## [1] "citric.acid"
## [1] "fixed.acidity"
## [1] 0.3263575

## [1] "the mutual value of"
## [1] "free.sulfur.dioxide"
## [1] "total.sulfur.dioxide"
## [1] 0.355175

## [1] "the mutual value of"
## [1] "total.sulfur.dioxide"
## [1] "free.sulfur.dioxide"
## [1] 0.355175

## [1] "the mutual value of"
## [1] "density"
## [1] "fixed.acidity"
## [1] 0.3268869

## [1] "the mutual value of"
## [1] "pH"
## [1] "fixed.acidity"
## [1] 0.3384619
quick summary
I found that the relationships in “fixed.acidity”,“volatile.acidity”, “citric.acid”,“sulphates” and “alcohol” are quite interesting. Thus, I decided to look further into these plots.

graph summary
- the three line’s not overlapping, which means different quality wine do have different characterstics on these two vars
- however, the line is rather flat, which means quality is primarily related with the sulphates

graph summary
- the three line’s not overlapping, which means different quality wine do have different characterstics on these two vars
- the worst quality wine has the most volatile.acidity, however its relationship is a bit more complicated and is represented by the dotted line in the graph

graph summary
- the best quality wine’s has either low alcohol and high fixed.acidity, or high alcohol and low fixed.acidity
- for medium and low quality wine, it’s rather random

graph summary
- the three line’s not overlapping except when sulphates is larger than 1.5
- the best quality winehas the lowest volatile.acidity

graph summary
- the best wine has high alcohol, for the others it’s more random

graph summary
- the worst wine has highest volatile.acidity
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
I did two kinds of plotting:
- did a scatter plot of all two paired vars, with different quality colored differently
- scatter plot some vars of choice and add a smooth line with different quality marked with different line
Result:
Most of the vars don’t really strengthen when combined. We’ll see a straight line in the graph, indicating a only one of the var (instead of both) is influencing the result.
Were there any interesting or surprising interactions between features?
I found the interaction between alcohol and fixed.acidity particularly interesting. As fixed.acidity and alcohol both get higher, the quality of the wine increases.
Final Plots and Summary
Plot One

Description One
- The median (as well as the “box”) of citric.acid gets larger when the quality gets better, indicating a positive correlation between citric.acid and quality.
- However, the full range (the “line”) remain of the same length and position, indicating tha some outlier observations influencing the result.
- When quality is set to either 5 or 6, the distribution of the “dot”s are quite similar, indicating the fact that medium quality wine are all quite similar.
Plot Two

Description Two
- We can see that the “box” and the median decrease as the volatile.acidity decreases.
- Quality 7 and 8 has similar property when it comes to volatile.acidity.
- However, when quality is set to either 5 or 6, there’s tons of outliers, might cause some potential problems. This group of data might not be properly represented in this graph.
Plot Three

Description Three
- The high quality wine’s line is mostly larger (closer to right) than the low and medium quality wine, indicating that the high quality wine has a overall higher (alcohol + fixed.acidity)
- However, most wine are in the “low alcohol low fixed.acidity” category, and are labeled green to indicate lower quality.
Reflection
When dealing with this dataset, there are two major problem that I found particularly challenging: 1) How to use the right kind of graph to represent the data: there are ton’s of kinds of graphs out there, however some of them might be quite hard to read while other might not accurately represent the data. Furthermore, even if some kind of graph is useful when evaulating certain kinds of data, I might not be sufficiently familiar with that kind of graph to use it well. 2) How to determine whether a particular kind of relationship is “interesting” enough: When dealing with these graph, while some of the relationship are strong enough to be identified, others might just be so vague that I don’t know whether to call those “interesting” or not.
Limits when dealing with this dataset:
- I just used plain eyes (plus some help from the “smooth” line function which plots some regression-style line on the graph), so the result might be too subjective
- In this dataset, it’s quite hard to find any complicated relationships since all the vars are rather simple and don’t have any extra dimensions (e.g. time) to it.
- Since I don’t really know wine, I don’t really know how to interpret the result.
Some suggestions:
- I think using machine learning to recognize relationship will be a more efficient and more objective approach. Since our dependent variable is a ordinal var, we can either 1) convert it to a binary var (e.g. high quality or not) and use classfiers such as decision tree or 2)try to use regression algos to predict the result
- the relationship between “alcohol” and “fixed.acidity” is quite interesting and worth further investigation.